feat(synthetic-sf): add script to generate synthetic structure factors by DorisMai · Pull Request #234 · diff-use/sampleworks

DorisMai · 2026-05-14T23:04:12Z

Summary

Add generate_synthetic_sf.py to produce MTZ files of structure factors using SFcalculator (SFC), with optional bulk solvent + default scaling allowed as well as R-free flag generation. Assumes no anomalous scattering for now.
Mirror generate_synthetic_density.py in terms of supporting single_structure vs batch_csv generation, altloc occupancy overwrite, and selection and stripping of hydrogens/waters/ligands.
Refactor out assign_occupancies and selection/stripping from generate_density_sf.py into eval/synthetic_utils for reuse in the SF scripts.
SFC is instantiated via AtomArray though atomarray_to_gemmi. As this function is likely reused in prototyping the new reward function, test is added. Test is specific to SFC environment.

Test plan

SFC is kept as a separate dependency group, which results in a few new environments -- analysis-dev-sfc, boltz-dev-sfc, and rf3-dev-sfc, that eventually will be merged (or added to workflows).
❗ There is dependency conflict with Protenix (Issue Dependency conflict betwen sfcalculator-torch and protenix #235 ), hence no corresponding env.
All 3 new environments passed tests/eval/test_generate_synthetic_sf.py
Validate occupancy change effects visually in coot for 6B8X and 2YL0
Validate batch_csv mode works
Validate Ftotal and Fprotein values on 5SOP_chainA_altlocB with manual selection/stripping/overwrite in pymol --> pymol saved to cif --> SFC from cif

Next steps

Prototype new reward function, focus on single chain for now.
Test multi-chain synthetic generation, otherwise Ftotal does not make sense.
Estimate memory usage/constraints in SFC.

Summary by CodeRabbit

New Features
- CLI to generate synthetic structure-factor amplitudes with single-file or CSV batch modes, MTZ output, per-row overrides, and optional saving of processed structures.
Refactor
- Centralized structure loading, selection/stripping, altloc detection, and occupancy assignment into shared utilities.
Tests
- Added tests for conversion, occupancy validation/behavior, and scattering-factor computations.
Bug Fixes / Robustness
- Improved batch parsing and A3M handling; safer CSV column handling; clearer validation and error reporting for metrics and model checkpoint inputs.
Chores
- Project config updated with SFC-related dependency and a pandas version override.

coderabbitai · 2026-05-14T23:04:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds shared structure-loading and occupancy utilities, refactors density generation to use them, provides a new CLI to compute/write synthetic MTZs (single and batch), updates pyproject dependencies/rules, and adds tests validating conversion, occupancy handling, and scattering outputs.

Changes

Synthetic Structure Factor Generation with Refactored Utilities

Layer / File(s)	Summary
Configuration and dependency updates `pyproject.toml`	Adds `joblib` and `sfcalculator-torch>=0.3.2`, introduces a Pixi dependency-overrides pin for `pandas`, and reorders `[tool.ty.rules]`.
Core structure loading and occupancy utilities `src/sampleworks/eval/synthetic_utils.py`	New module with `validate_occupancy_values()`, `assign_occupancies()` (modes: `default`, `uniform`, `custom`), and `load_structure_for_synthetic_reward()` that loads, optionally selects/strips, detects altlocs, and assigns occupancies.
Refactor density generation to use shared loader `src/sampleworks/eval/generate_synthetic_density.py`	Delegates preprocessing and occupancy assignment to `load_structure_for_synthetic_reward(...)`, removes the local `assign_occupancies`, and adds a TODO about CUDA context contention for loky parallel jobs.
Structure-factor CLI and processing `src/sampleworks/eval/generate_synthetic_sf.py`	New CLI: `BatchRowForMTZ` parsing, `atomarray_to_gemmi` conversion, `process_amplitudes_to_dataset` MTZ construction (synthetic SIG, sigma casting, optional R-free), per-row `_process_single_row`, batch `process_batch` (joblib.loky), `load_batch_csv`, `parse_args`, and `main`. Supports occupancy and crystallography overrides and optional CIF saving.
Tests for conversion, occupancy, and scattering `tests/eval/test_generate_synthetic_sf.py`	New tests checking AtomArray→Gemmi conversion (cell, space group, atoms), occupancy propagation and validation (warnings/errors), and Fprotein magnitude consistency and sensitivity using SFC_Torch.

Sequence Diagram(s)

sequenceDiagram
  participant CLI
  participant BatchProcessor
  participant Loader as load_structure_for_synthetic_reward
  participant Converter as atomarray_to_gemmi
  participant SF as SFcalculator
  participant RS as reciprocalspaceship
  participant FS as Filesystem

  CLI->>BatchProcessor: parse CSV / CLI args
  BatchProcessor->>Loader: load file, apply selection/strips, assign occupancies
  Loader-->>BatchProcessor: AtomArray or None
  BatchProcessor->>Converter: convert AtomArray -> gemmi.Structure
  Converter->>SF: provide gemmi.Structure for scattering calc
  SF-->>RS: prepare_dataset (amplitudes, sigmas)
  RS->>FS: write MTZ output (optional CIF save)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

diff-use/sampleworks#75: Overlaps refactor around structure loading/stripping used by density generation.
diff-use/sampleworks#189: Related edits to tool.ty.rules reordering in pyproject.toml.

Suggested reviewers

k-chrispens
marcuscollins

"I hopped through pdb and tree,
occupancies sorted neatly,
factors hum in SFC-sky,
batch rows march on fleetly,
🐇✨"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 65.85% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(synthetic-sf): add script to generate synthetic structure factors' accurately describes the main change—adding a new script (generate_synthetic_sf.py) to produce MTZ files of structure factors, which is clearly the primary feature in this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dm/add-sfc-synthetic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

src/sampleworks/eval/synthetic_utils.py (3)

158-158: 💤 Low value

Add explanatory comment for type ignore.

Per coding guidelines, type ignores should include explanatory comments. The # ty: ignore[invalid-argument-type] suppresses a type error but doesn't explain why it's safe.

♻️ Proposed fix

-    altloc_info = detect_altlocs(atom_array)  # ty: ignore[invalid-argument-type]
+    # ty: ignore[invalid-argument-type] - atom_array is AtomArray after stripping ops,
+    # detect_altlocs signature may be overly narrow
+    altloc_info = detect_altlocs(atom_array)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/sampleworks/eval/synthetic_utils.py` at line 158, The call to
detect_altlocs(atom_array) uses a type ignore comment (# ty:
ignore[invalid-argument-type]) without justification; update the invocation in
synthetic_utils.py to keep the ignore but add a short explanatory comment after
it explaining why the type error is safe (e.g., atom_array is known at runtime
to match the expected type/has required attributes or originates from a
validated source) and reference the specific symbols detect_altlocs and
atom_array so future readers/linters understand the rationale.

81-91: ⚡ Quick win

Extra occ_values are silently ignored.

When len(occ_values) > len(altloc_info.altloc_ids), extra values are silently discarded without warning. This asymmetry with the "too few values" case (which warns) may surprise callers.

Consider adding a warning for the excess values case:

♻️ Proposed fix

         if len(occ_values) != len(altloc_info.altloc_ids):
+            if len(occ_values) > len(altloc_info.altloc_ids):
+                logger.warning(
+                    f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. "
+                    "Extra values will be ignored."
+                )
+                occ_values = occ_values[:len(altloc_info.altloc_ids)]
+            else:
-            logger.warning(
-                f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. "
-                "The missing values are automatically set to 0."
-            )
-            occ_values = occ_values + [0.0] * (len(altloc_info.altloc_ids) - len(occ_values))
+                logger.warning(
+                    f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. "
+                    "The missing values are automatically set to 0."
+                )
+                occ_values = occ_values + [0.0] * (len(altloc_info.altloc_ids) - len(occ_values))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/sampleworks/eval/synthetic_utils.py` around lines 81 - 91, The code
currently only warns when occ_values is shorter than altloc_info.altloc_ids but
silently ignores extra occ_values; update the logic around occ_values handling
(before the for loop that zips sorted(altloc_info.altloc_ids) with occ_values)
to detect when len(occ_values) > len(altloc_info.altloc_ids) and emit a
logger.warning indicating how many extra occupancy values were provided and that
they will be ignored, then trim occ_values to the expected length; keep the
existing branch that pads with zeros when occ_values is shorter and continue
assigning occupancy using occupancy[altloc_info.atom_masks[altloc]] in the
existing for altloc, occ loop so behavior remains consistent.

91-91: 💤 Low value

Return type cast may be incorrect for AtomArrayStack inputs.

The function signature accepts AtomArray | AtomArrayStack, but cast(AtomArray, result) always asserts AtomArray. If an AtomArrayStack is passed, this cast is misleading to type checkers and callers.

If the function genuinely only handles AtomArray, narrow the signature. Otherwise, remove the cast or make it conditional.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/sampleworks/eval/synthetic_utils.py` at line 91, The return cast to
AtomArray is incorrect when the function accepts AtomArray | AtomArrayStack:
update the function (in synthetic_utils.py) so it either narrows the signature
to AtomArray if it truly never returns a stack, or remove the unconditional cast
and return result with its original type; alternatively, if both are valid,
return a union type or perform a runtime check (e.g., isinstance(result,
AtomArrayStack)) and cast/annotate conditionally so callers and type checkers
see the correct AtomArray vs AtomArrayStack type instead of always using
cast(AtomArray, result).

tests/eval/test_generate_synthetic_sf.py (1)

39-44: 💤 Low value

Add return type annotation for consistency.

The stripped_atom_array fixture is missing a return type annotation, unlike the other fixtures.

Suggested fix

+from biotite.structure import AtomArray
+
 `@pytest.fixture`(scope="module")
-def stripped_atom_array(resources_dir: Path):
+def stripped_atom_array(resources_dir: Path) -> AtomArray:
     arr = load_structure_with_altlocs(resources_dir / "6b8x" / "6b8x_final.pdb")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval/test_generate_synthetic_sf.py` around lines 39 - 44,
stripped_atom_array fixture is missing a return type; update its signature to
include the same return type used by the other fixtures (e.g. def
stripped_atom_array(resources_dir: Path) -> AtomArray:) and add the appropriate
type import if not present. Keep the body unchanged (calls to
load_structure_with_altlocs, remove_hydrogens, remove_waters, keep_amino_acids,
keep_polymer) but ensure the declared return type matches the actual returned
value.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/sampleworks/eval/generate_synthetic_sf.py`:
- Around line 442-460: The Parallel invocation may spawn multiple loky workers
that each initialize CUDA if device is a GPU, causing GPU contention; update the
code around the Parallel(...) call to detect GPU devices (check the passed-in
device variable for values like "cuda" or starting with "cuda:") and if a GPU is
detected and n_jobs > 1 either set n_jobs = 1 (safe default) or emit a clear
warning (e.g., via logging or warnings.warn) and fall back to n_jobs = 1; ensure
this logic is applied before calling Parallel and reference the Parallel(...)
call and the _process_single_row invocation so the change prevents multiple
worker processes from initializing separate CUDA contexts.

In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 100-122: These two GPU-dependent tests
(test_fprotein_matches_direct_gemmi and test_fprotein_changes_with_occupancy)
should be annotated with the pytest slow marker: add import pytest if missing
and place `@pytest.mark.slow` immediately above each test function definition that
uses the device fixture/SFcalculator so they are excluded from fast CI runs.
Ensure you only add the decorator (no other behavioral changes) above the def
lines for test_fprotein_matches_direct_gemmi and
test_fprotein_changes_with_occupancy.

---

Nitpick comments:
In `@src/sampleworks/eval/synthetic_utils.py`:
- Line 158: The call to detect_altlocs(atom_array) uses a type ignore comment (#
ty: ignore[invalid-argument-type]) without justification; update the invocation
in synthetic_utils.py to keep the ignore but add a short explanatory comment
after it explaining why the type error is safe (e.g., atom_array is known at
runtime to match the expected type/has required attributes or originates from a
validated source) and reference the specific symbols detect_altlocs and
atom_array so future readers/linters understand the rationale.
- Around line 81-91: The code currently only warns when occ_values is shorter
than altloc_info.altloc_ids but silently ignores extra occ_values; update the
logic around occ_values handling (before the for loop that zips
sorted(altloc_info.altloc_ids) with occ_values) to detect when len(occ_values) >
len(altloc_info.altloc_ids) and emit a logger.warning indicating how many extra
occupancy values were provided and that they will be ignored, then trim
occ_values to the expected length; keep the existing branch that pads with zeros
when occ_values is shorter and continue assigning occupancy using
occupancy[altloc_info.atom_masks[altloc]] in the existing for altloc, occ loop
so behavior remains consistent.
- Line 91: The return cast to AtomArray is incorrect when the function accepts
AtomArray | AtomArrayStack: update the function (in synthetic_utils.py) so it
either narrows the signature to AtomArray if it truly never returns a stack, or
remove the unconditional cast and return result with its original type;
alternatively, if both are valid, return a union type or perform a runtime check
(e.g., isinstance(result, AtomArrayStack)) and cast/annotate conditionally so
callers and type checkers see the correct AtomArray vs AtomArrayStack type
instead of always using cast(AtomArray, result).

In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 39-44: stripped_atom_array fixture is missing a return type;
update its signature to include the same return type used by the other fixtures
(e.g. def stripped_atom_array(resources_dir: Path) -> AtomArray:) and add the
appropriate type import if not present. Keep the body unchanged (calls to
load_structure_with_altlocs, remove_hydrogens, remove_waters, keep_amino_acids,
keep_polymer) but ensure the declared return type matches the actual returned
value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 724f9f85-5506-461e-aa16-c524bebf8acf

📥 Commits

Reviewing files that changed from the base of the PR and between fbf6d38 and f07cca8.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

pyproject.toml
src/sampleworks/eval/generate_synthetic_density.py
src/sampleworks/eval/generate_synthetic_sf.py
src/sampleworks/eval/synthetic_utils.py
tests/eval/test_generate_synthetic_sf.py

marcuscollins

Please have a look at the comments. I don't see anything major but there are some improvements to make. Also, there are merge conflicts that need to be resolved.

marcuscollins · 2026-05-26T16:23:53Z

 [tool.pixi.environments]
 analysis = {features = ["analysis"]}
 analysis-dev = {features = ["analysis", "dev"]}
+analysis-dev-sfc = {features = ["analysis", "dev", "sfc"]}


Is there a reason not to just make sfc part of the core requirements (I recall maybe there were dependency conflicts?) At some point we will want to use sfc more generally, right?

yes sfc conflicts with protenix, but Minhuan found a workaround by overriding the pandas dependency (#235), I am adding the fix now. Nevertheless, I am wondering if we should add reward-related dependencies all together in the core requirements. In the future we might have more of dependencies like SFC for other experimental modalities, but users might not care all experimental modalities, and their dependencies might conflict each other. I thought a more composable environment would be safer, though less convenient. @marcuscollins lmk your thoughts

@DorisMai this is always a thorny issue. One approach is to push others not to pin dependencies down to the sub minor version level. It is almost never actually necessary but there seems to be an increasing trend to do this. I think we should discuss with the engineers what are the possible approaches to additional experimental modalities. It would be nice if we could organize this around added features, like pip does, rather than ending up with a Cartesian product of pixi environments.

For now, let's try to keep reciprocal space data in the core dependencies.

coderabbitai

🧹 Nitpick comments (3)

src/sampleworks/eval/generate_synthetic_sf.py (3)

491-608: 💤 Low value

Add docstring to parse_args.

Per coding guidelines, all functions should have NumPy-style docstrings.

 def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments for synthetic structure factor generation.
+
+    Returns
+    -------
+    argparse.Namespace
+        Parsed command-line arguments
+    """
     parser = argparse.ArgumentParser(

As per coding guidelines: "Always include NumPy-style docstrings for every function and class."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/sampleworks/eval/generate_synthetic_sf.py` around lines 491 - 608, The
parse_args function lacks a NumPy-style docstring; add a concise docstring
immediately under the def parse_args() line describing purpose, parameters
(none), returns (argparse.Namespace) and any important behaviour (flags/groups
like selection, occupancy-mode, simulate-solvent-and-scale, output vs
output-dir, batch vs single-structure modes), following NumPy docstring
conventions so tools and readers can understand the CLI interface; ensure the
docstring mentions default behaviors and side effects (e.g., base-dir used only
in batch mode) and stays brief.

179-212: 💤 Low value

Add missing Returns section to docstring.

The function signature specifies -> rs.DataSet but the docstring lacks a corresponding Returns section. Per coding guidelines, NumPy-style docstrings should document return values.

     sigma_f_scale: float
         Scale factor to make a fake sigma column from the structure factor amplitude
         values so that SFcalculator can load the output MTZ file as synthetic reward
         without errors. The actual sigma values only matter when computing R-factor.
     output_path: Path | None
         If provided, write the dataset to this MTZ file path.
+
+    Returns
+    -------
+    rs.DataSet
+        Dataset with structure factor amplitudes, synthetic sigma column, and
+        optional R-free flags.
     """

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/sampleworks/eval/generate_synthetic_sf.py` around lines 179 - 212, Add a
NumPy-style "Returns" section to the docstring of process_amplitudes_to_dataset
describing that it returns an rs.DataSet containing the processed reflections
(with columns for miller indices, structure factor amplitudes, generated sigma
column scaled by sigma_f_scale, and R-free flags), and note that if output_path
is provided the dataset is also written to an MTZ; reference the function name
process_amplitudes_to_dataset and the return type rs.DataSet in the description
and briefly mention any conditions (e.g., test_fraction=0 disables R-free test
set).

611-663: 💤 Low value

Add docstring to main.

Per coding guidelines, all functions should have NumPy-style docstrings.

 def main() -> None:
+    """Entry point for synthetic structure factor generation CLI.
+
+    Supports two modes:
+    - Single structure: Generate MTZ for one input file
+    - Batch: Process multiple structures from a CSV file
+    """
     args = parse_args()

As per coding guidelines: "Always include NumPy-style docstrings for every function and class."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/sampleworks/eval/generate_synthetic_sf.py` around lines 611 - 663, Add a
NumPy-style docstring to the main() function describing purpose, parameters
(none taken directly but mention CLI args from parse_args()), return value
(None), and side effects (calls parse_args(), try_gpu(), process_batch(),
_process_single_row(), exits on error, writes outputs), and include brief
examples or usage notes if appropriate; ensure the docstring references behavior
such as handling args.batch_csv versus args.structure, creation of
BatchRowForMTZ, and that it may call process_batch(...) or
_process_single_row(...) and call sys.exit(1) on missing inputs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/sampleworks/eval/generate_synthetic_sf.py`:
- Around line 491-608: The parse_args function lacks a NumPy-style docstring;
add a concise docstring immediately under the def parse_args() line describing
purpose, parameters (none), returns (argparse.Namespace) and any important
behaviour (flags/groups like selection, occupancy-mode,
simulate-solvent-and-scale, output vs output-dir, batch vs single-structure
modes), following NumPy docstring conventions so tools and readers can
understand the CLI interface; ensure the docstring mentions default behaviors
and side effects (e.g., base-dir used only in batch mode) and stays brief.
- Around line 179-212: Add a NumPy-style "Returns" section to the docstring of
process_amplitudes_to_dataset describing that it returns an rs.DataSet
containing the processed reflections (with columns for miller indices, structure
factor amplitudes, generated sigma column scaled by sigma_f_scale, and R-free
flags), and note that if output_path is provided the dataset is also written to
an MTZ; reference the function name process_amplitudes_to_dataset and the return
type rs.DataSet in the description and briefly mention any conditions (e.g.,
test_fraction=0 disables R-free test set).
- Around line 611-663: Add a NumPy-style docstring to the main() function
describing purpose, parameters (none taken directly but mention CLI args from
parse_args()), return value (None), and side effects (calls parse_args(),
try_gpu(), process_batch(), _process_single_row(), exits on error, writes
outputs), and include brief examples or usage notes if appropriate; ensure the
docstring references behavior such as handling args.batch_csv versus
args.structure, creation of BatchRowForMTZ, and that it may call
process_batch(...) or _process_single_row(...) and call sys.exit(1) on missing
inputs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e61e2df-c13c-4ee7-ac07-954e20a9b9d8

📥 Commits

Reviewing files that changed from the base of the PR and between f07cca8 and 7207a4e.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

pyproject.toml
src/sampleworks/eval/generate_synthetic_density.py
src/sampleworks/eval/generate_synthetic_sf.py
src/sampleworks/eval/synthetic_utils.py
tests/eval/test_generate_synthetic_sf.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/sampleworks/eval/synthetic_utils.py
tests/eval/test_generate_synthetic_sf.py
pyproject.toml
src/sampleworks/eval/generate_synthetic_density.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 57-68: Add a NumPy-style docstring to the helper function
_compute_fprotein describing its purpose, parameters and return value;
specifically document that _compute_fprotein(gemmi_structure: gemmi.Structure,
device: torch.device) computes the Fprotein_asu using SFcalculator (constructed
with PDBParser(gemmi_structure), mtzdata=None, dmin=DMIN, mode="xray",
anomalous=False, set_experiment=False, device=device) and returns a
numpy.ndarray; include param entries for gemmi_structure and device with their
types and a Returns section stating it returns np.ndarray of computed
Fprotein_asu and any side effects (calls sfc.calc_fprotein()) in NumPy docstring
format placed immediately below the def line for _compute_fprotein.
- Around line 71-148: The TestAtomArrayToGemmi class lacks a NumPy-style
docstring; add a concise NumPy-style docstring immediately above the class
TestAtomArrayToGemmi declaration that summarizes the purpose (validating
atomarray→gemmi conversion) and briefly describes the grouped tests
(cell/spacegroup preservation, atom fidelity, occupancy handling, and
structure-factor computation) so it complies with the project docstring
guidelines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e6b7783-7bb2-46cc-b7a2-aa7fca7324ce

📥 Commits

Reviewing files that changed from the base of the PR and between 7207a4e and 0dc3acb.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

pyproject.toml
src/sampleworks/eval/generate_synthetic_density.py
src/sampleworks/eval/generate_synthetic_sf.py
src/sampleworks/eval/synthetic_utils.py
tests/eval/test_generate_synthetic_sf.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/sampleworks/eval/generate_synthetic_density.py
pyproject.toml
src/sampleworks/eval/synthetic_utils.py
src/sampleworks/eval/generate_synthetic_sf.py

marcuscollins

Please go ahead and fix as much of the ty warnings/errors as you can easily, make an issue for the rest (link to the log file on GitHub) and assign to me for further triage.

I did add a comment on joblib--it should be an explicit dependency in the analysis environments. I thought it was, but maybe it has been accidentally removed.

One last thing: please make sure you have docstrings everywhere, including tests.

I'll approve and leave it up to you to judge when these last few things are done.

marcuscollins · 2026-05-28T16:20:30Z

+    save_structure
+        If True, save each processed structure as mmCIF to output_dir.
+    """
+    from joblib import delayed, Parallel


At some level that's fine as long as the code never gets called except in the analysis environment, right? Is the issue that it raises an import error in type checking? Or are there tests that fail in certain environments? If so, there are ways to mark the tests to only run under certain conditions.

In any case, we should include joblib as an explicit dependency, at least in the analysis environments. I thought I had added it, please add it back in if it isn't there.

But I don't feel strongly about it. The case you describe is one of the (few) reasons for inline imports like this.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/eval/test_generate_synthetic_sf.py (1)

56-67: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add NumPy-style docstring to _compute_fprotein.

This helper function is missing a docstring. As per coding guidelines, every function must include a NumPy-style docstring documenting its parameters, return value, and purpose.

📝 Suggested docstring

 def _compute_fprotein(gemmi_structure: gemmi.Structure, device: torch.device) -> np.ndarray:
+    """Compute structure factors (Fprotein) from a Gemmi structure.
+
+    Parameters
+    ----------
+    gemmi_structure : gemmi.Structure
+        Input crystallographic structure.
+    device : torch.device
+        Device for SFcalculator computation.
+
+    Returns
+    -------
+    np.ndarray
+        Complex structure factors in ASU, shape (n_reflections,).
+    """
     sfc = SFcalculator(

As per coding guidelines: "Always include NumPy-style docstrings for every function and class."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval/test_generate_synthetic_sf.py` around lines 56 - 67, The helper
function _compute_fprotein is missing a NumPy-style docstring; add a concise
NumPy-style docstring at the top of _compute_fprotein describing its purpose
(compute Fprotein for an ASU using SFcalculator), the parameters
(gemmi_structure: gemmi.Structure, device: torch.device), and the return value
(np.ndarray containing Fprotein_asu), and mention any side effects or
expectations (uses PDBParser and SFcalculator, assumes DMIN and returns the
result of sfc.Fprotein_asu).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 56-67: The helper function _compute_fprotein is missing a
NumPy-style docstring; add a concise NumPy-style docstring at the top of
_compute_fprotein describing its purpose (compute Fprotein for an ASU using
SFcalculator), the parameters (gemmi_structure: gemmi.Structure, device:
torch.device), and the return value (np.ndarray containing Fprotein_asu), and
mention any side effects or expectations (uses PDBParser and SFcalculator,
assumes DMIN and returns the result of sfc.Fprotein_asu).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5faf1e96-b051-43e1-a290-cfdcdc94798e

📥 Commits

Reviewing files that changed from the base of the PR and between 41b9348 and ba29c9e.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

pyproject.toml
scripts/eval/classify_altloc_regions.py
src/sampleworks/eval/generate_synthetic_sf.py
src/sampleworks/eval/synthetic_utils.py
src/sampleworks/metrics/metric.py
src/sampleworks/utils/guidance_script_utils.py
src/sampleworks/utils/mmseqs2.py
tests/cli/test_guidance_cli.py
tests/eval/test_generate_synthetic_sf.py
tests/rewards/test_real_space_density_reward.py
tests/utils/test_framework_utils.py

✅ Files skipped from review due to trivial changes (2)

scripts/eval/classify_altloc_regions.py
tests/rewards/test_real_space_density_reward.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/sampleworks/eval/synthetic_utils.py
src/sampleworks/eval/generate_synthetic_sf.py

…ng gemmi structure.

…ncies function

…m utils instead

…rd into synthetic_utils

…etic sf script

…nment, document GPU memory issue in both synthetic scripts

…d tests docstring

…b to analysis dependency

Narrow `assign_occupancies` to `AtomArray -> AtomArray` and replace two `cast(AtomArray, ...)` calls in `synthetic_utils.py` with a single `assert isinstance(atom_array, AtomArray)` after the broadening transforms. Drop a now-redundant `# ty: ignore[invalid-argument-type]` on `detect_altlocs`. Add `import reciprocalspaceship.utils` so ty resolves `rs.utils.add_rfree`. Apply the same narrowing pattern to the `stripped_atom_array` test fixture. Resolves 4 ty warnings in the protenix-dev typecheck. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

DorisMai had a problem deploying to gpu-testing May 14, 2026 23:04 — with GitHub Actions Error

DorisMai had a problem deploying to gpu-testing May 15, 2026 23:33 — with GitHub Actions Error

DorisMai marked this pull request as ready for review May 15, 2026 23:58

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Comment thread src/sampleworks/eval/generate_synthetic_sf.py

Comment thread tests/eval/test_generate_synthetic_sf.py Outdated

DorisMai mentioned this pull request May 16, 2026

Dependency conflict betwen sfcalculator-torch and protenix #235

Closed

DorisMai requested a review from marcuscollins May 16, 2026 00:16

marcuscollins requested changes May 26, 2026

View reviewed changes

This was referenced May 26, 2026

feat(synthetic-sf): handle GPU memory contention in parallel batch processing #241

Closed

GPU memory contention when generating synthetic data with n_jobs>1 on CUDA #242

Open

📝 CodeRabbit Chat: Implement requested code changes #243

Closed

DorisMai force-pushed the dm/add-sfc-synthetic branch from f07cca8 to 7207a4e Compare May 27, 2026 02:05

DorisMai had a problem deploying to gpu-testing May 27, 2026 02:05 — with GitHub Actions Error

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

DorisMai force-pushed the dm/add-sfc-synthetic branch from 7207a4e to 0dc3acb Compare May 27, 2026 21:58

DorisMai had a problem deploying to gpu-testing May 27, 2026 21:58 — with GitHub Actions Error

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread tests/eval/test_generate_synthetic_sf.py

Comment thread tests/eval/test_generate_synthetic_sf.py

DorisMai temporarily deployed to gpu-testing May 27, 2026 23:35 — with GitHub Actions Inactive

DorisMai had a problem deploying to gpu-testing May 27, 2026 23:35 — with GitHub Actions Error

DorisMai temporarily deployed to gpu-testing May 27, 2026 23:35 — with GitHub Actions Inactive

DorisMai requested a review from marcuscollins May 27, 2026 23:41

marcuscollins approved these changes May 28, 2026

View reviewed changes

DorisMai had a problem deploying to gpu-testing May 29, 2026 00:46 — with GitHub Actions Error

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

DorisMai temporarily deployed to gpu-testing May 29, 2026 01:01 — with GitHub Actions Inactive

DorisMai had a problem deploying to gpu-testing May 29, 2026 01:01 — with GitHub Actions Error

DorisMai temporarily deployed to gpu-testing May 29, 2026 01:01 — with GitHub Actions Inactive

This was referenced May 29, 2026

Typecheck error in protenix-dev #247

Open

Update generate_synthetic_density.py naming and validation to be consistent with generate_synthetic_sf.py #248

Open

Typecheck warning #249

Open

DorisMai and others added 14 commits June 12, 2026 10:13

feat: generate synthetic sf for one row with occupancy changeable usi…

ecb59fa

…ng gemmi structure.

feat: generate synthetic sf using atomarray. refactored assign_occupa…

8373ad7

…ncies function

feat: support batch mode generation of synthetic sf

d878125

refactor(density): remove duplicate assign_occupancies and import fro…

08b1eed

…m utils instead

test(sfc): add atomarray_to_gemmi tests using 6b8x

c9f9907

refactor: extract shared structure loading in generate synthetic rewa…

b0280be

…rd into synthetic_utils

refactor(sfc): improve naming, documentation, and validation in synth…

365b1cc

…etic sf script

test(sfc): add more tests on atomarray_to_gemmi and occupancy reassig…

480d383

…nment, document GPU memory issue in both synthetic scripts

build(deps): add protenix-dev-sfc env and relax setuptools pin and ad…

4d81fe0

…d tests docstring

build(deps): promote sfcalculator-torch to core dependency, add jobli…

88be517

…b to analysis dependency

chore(types): fix easy ty warnings shared across boltz/protenix/rf3 envs

e9fefb3

chore(doc): add docstring in tests

baedd40

fix(msa): normalize protenix MSA paths to Path in get_msa

c75d08b

DorisMai force-pushed the dm/add-sfc-synthetic branch from dbc7af1 to c75d08b Compare June 12, 2026 20:02

DorisMai requested a deployment to gpu-testing June 12, 2026 20:02 — with GitHub Actions Waiting

DorisMai merged commit 0fb4f93 into main Jun 12, 2026
8 of 11 checks passed

Conversation

DorisMai commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Next steps

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marcuscollins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcuscollins May 26, 2026

Choose a reason for hiding this comment

Uh oh!

DorisMai May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcuscollins May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marcuscollins left a comment

Choose a reason for hiding this comment

Uh oh!

marcuscollins May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DorisMai commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

DorisMai May 27, 2026 •

edited

Loading